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Abstract 

Most currently used measures of inter-rater agreement for the 
nominal case incorporate a correction for "chance agreement." The 
definition of chance agreement is not the same for all coefficients, 
however. Three chance-corrected coefficients are Cohen's k, 
Scott's n, and the S index of Bennett, Goldstein and Alpert, which has 
reappeared in many guises. For all three measures, chance is defined 
to include independence between raters. Scott's II involves a further 
assumption of homogeneous rater marginals under chance. For the S 
coefficient, uniform marginals for both raters under chance are 
assumed. Because of these disparate formulations, k, II, and S can lead 
to different conclusions about rater agreement. . Consideration of the 
properties of these measures leads to the recommendation that a test of 
marginal homogeneity be conducted as a first step in Che assessment of 
rater agreement. Rejection of the hypothesis of homogeneity is 
sufficient to conclude that agreement is poor. If the homogeneity 
hypothesis is retained, II can be used as an index of agreement. 
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In educational and psychological research, it is frequently of 
interest to assign subjects to nominal categories, such as 
demographic groups, classroom behavior types, or psychodiagnostic 
classifications. Because the reproducibility of the ratings is taken 
to be an indicator of the quality of the category definitions and the 
raters* ability to apply them, it is often required that the 
classification task be performed by two raters. For k categories, 
the results can be tabled in a k x k agreement matrix in which the 
main diagonal contains the cases for which the raters agree. 

A multi*:ude of inter-rater agreement measures have been proposed 

by researchers in the fields of statistics, biostatistics , 

psychology, psychiatry, education, and sociology (see Landis & Koch 

[1975a, 1975b, 1977] for useful reviews). This article focuses on 

three coefficients that can be expressed in the form 

Pq - Pc(A) 
1 - Pc(A) ' 

k 

where Pq I Pn is the observed proportion of ageement, p^^ is the 
i»i 

proportion of cases in the i^^ diagonal cell of the table, and Pc(A) 
is the proportion of agreement expected by chance, as defined for 
coefficient A. These coefficients represent an attempt to correct Pq 
by subtracting from it the proportion of cases that fall on the 
diagonal by "chance". The numerator is then divided by 1 - Pc(A), the 
maximum non-chance agreement. (Note, however, that this maximum can 
be achieved only if the two raters have identical marginals. 
Otherwise, Pq cannot reach 1.00.) The resulting coefficient. A, is 
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assumed to provide a better description of the degree of inter-rater 
agreement than the "raw" proportion of agreement, Pq, 

One agreement index thcit can be expresed in the form of 
Equation 1 is the S coefficient of Bennett, Alpert, and Goldstein 
(1954), in which Pc(S) is defined as l/k. This measure has 
reappeared as the C coefficient of Janson and Vegelius (1979), the 
Kn index of Brennan and Prediger (1981) and, in the two-category 
case, the G index of Guilford (1961; Holley & Guilford, 1964) and 
the random error (RE) coefficient of Maxwell (1977). The 
equivalence of these five coefficients, which has largely gone 
unrecognized in the literature, is pointed out in the first part of 
this article. 

In the main portion of the article, the properties of S are 
compared to those of two other coefficients that can be expressed 
in the form of Equation 1: Scott's (1955) II coefficient and 
Cohen's (I960) k, currently the most popular index of rater 
agreement for nominal categories. For convenience, the definitions 
of Pc(A) associated with each coefficient are listed In Table la. 
Some identities between coefficients are given in Table lb. 
In the final section of the paper, some recommendations are made 
for assessing inter-rater agreement in the nominal case. In 
particular, the need for examining the marginal distributions of the 
raters is stressed. Although most of the article focuses on a 
descriptive approach to the assessment of inter-rater agreement, an 
inferential procedure for assessing marginal homogeneity is 

8 
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Table 1 



Definition of Pc(A) for k, II, and S 



Coefficient 



Definition 



n 

S (G) 



I Pi+ P+i 
i-1 

i=l 
1/k 



B 

Identities Between Coefficients* 
Condition Identity 
Pi+ " P+i> i » 1, 2..»k n a K 

k a 2, Pi4. « p+i, i«l, 2 Il«»c=(J) (the phi correlation) 

Pi+ = P+i =• 1/k, i » 1, 2...k S = n = K 

k « 2, pi+ « p+i =» 1/k, i = l,2 S«Il = K= G = (() 



In addition; the following identities hold by definition: 
RE « G, C « Kn = S. 
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presented, along with a proposed marginal homogeneity Index. 
Throughout the paper, a uniform notation system has been 
substituted for the notation used In the original presentations. 

The S Coefficient of Bennett, Alpert, and Goldstein 
Bennett et al. (1954) sought to evaluate the degree of 
agreement between two methods of obtaining Information about 
Interviewees: a printed poll and a lengthy Interview covering the 
same general subject matter as the poll. Tlr.cy proposed the 
following agreement coefficient: 

S=CT(Po-^) (2) 

The rationale they offered Is as follows: "The proportion 1/k 
represents the best estimate of [Pq] expected on the basis of 
chance ... The S score ... langes from zero to unity as [Pq] ranges 
from the value most probably expected on the basis of chance to 
unity" (p. 307). 

The RE and G Coefficients for 2x2 Tables 
Maxwell (1977) proposed an Index of Inter-rater agreement for 
2x2 tables, called the RE (random error) coefficient, that has 
received some favorable attention In the literature (Carey & 
Gottesman, 1978; Janes, 1979). Maxwell •s model for the assignment 
of subjects to categories can be outlined as follows: We assume 
that If both raters are "without doubt" In categorizing a subject, 
the raters must agree; If one or both raters Is In doubt about a 
case, they may either agree or disagree. Therefore, Pq Is 

o 10 
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spuriously Inflated because it Includes some doubtful cases. If 
and a2 denote the proportions of "true" agreements (i.e., excluding 
doubtful cases) for categories I and II, respectively, the 
proportion of doubtful cases is [l-Ca^ + a2)]. If it is assumed 
that these cases are allocated randomly to each of the four cells 
of the table, the cell frequencies will be as shown in Table 2. 
If we then wish to obtain the quantity a^ + a2, the proportion of 
agreement uncontaminated by doubtful cases, we proceed as follows: 

aj + 32 » pii + P22 - l/2[l-(ai + a2) 1 

« (Pll + P22) " (P12 + P21) 

« PO - Pd RE (3) 

where p^j is the proportion of cases in the i^^ row and the j^^ 
column and « + P21 is the proportion of disegreement. 
Maxwell's RE coefficient is algebraically equivalent to G, a measure 
of association for 2x2 tables proposed by Guilford (1961) and 
linear transformation to achieve this result: 

G « 2Po - 1 

« Po + (1 " Pd) " 1 (4) 
« Pq - Pd = RE 

Green (1981) developed a post hoc rationale for the G coefficient 
that is very similar to Maxwell^s development of RE. 

It is not difficult to generalize Maxwell's model to the case 
of k > 2. If we let a^ (i = 1, 2,...k) represent the proportion 
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Table 2 

Theoretical Cell Proportions for Maxwell's Models 

Rater 2 

Category I H jotal 

I ai + |. [l-(ai + a2)] { [l-Caj+az)] + 1 [ l-(ai + az) ] 

1 1 T ■ . 

II J [l-(ai + a2)] a2 +j + az)] aj + ^ [l-(ai + az)] 



Rater 



Total ai + J [l-(ai + az)] ^2 + j [l-(ai + az)] 1.00 

s aj and a2 represent the proportions of "true" agreements for categories I 
and II. 



12 

ERIC 



Inter-rater agreement 

9 



of true agreement i r the i^h category, then 



k k 
Po = I ai + ^ (1 - I ai) . (5) 
1=1 1=1 



If we let RE]^ denote the generalized RE coefficient, 

k 

. « I ai 

1=1 

k 

= (k - 1)[ I ai/(k - D] 
1=1 

k k 

= [V-laj^ + (1 - ^ai) - l]/(k - 1) 
1=1 1=1 



From Equation 5, we can see that this Is equal to 

R% = p = ^ (Po - ^) = s (6) 



The C and Coefficients for k x k Tabled 
Janson and Vegellus (1979) proposed a coefficient, C, which Is 
Identical to RE|^. Although C was described as a generalization of 
the G Index, Its equivalence to S was not noted. Brennan and 
Predlger (1981) presented a coefficient, k^, which, as they noted 
(p. 693), Is equivalent to S. (No mention was made of C, G, or RE.) 
For reasons described further below, Brennan and Predlger 
recommended that rather than k, be used In typical Inter-rater 
reliability studies. 
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Comparison of S, <, and 11 



To simplify the discussion below, RE, G, C, and are all 
referred to as S. As mentioned above, S, k , and n c^n be expressed 
in a common form (Equation 1), with the difference among them lying 
in the definition of the proportion of agreement expected to occur 
by chance. For each of the three coefficients, the formulation of 
P(](A) involves an assumption of independence of raters. That is, 
Pq(A) is derived by multiplying, for each category, the hypothesized 
values of the raters' marginal proportions under chance and then 
summing these products over the k categories. In its most general 
form, this sum can be expressed as 



where h^^. is the hypothesized marginal proportion of cases assigned to 

category i by rater 1 under chance and h^± is the corresponding 

proportion for rater 2. However, the three coefficients incorporate 

differing assumptions about the marginal distributions of each rater 

under chance, which s of course, are unobservable. 

Let us now consider how each of the three agreement 

coef f f icients defines the proportion of chance agreement. Pc(S) 

is defined as l/k. In this case, "chance" is understood to mean 

that the two raters independently assign cases to categories in a 

random fashion, each producing a uniform distribution; that is 

h-i^ =s h^^ =t l/k , 1 =» 1, 2, ... k. Under these circumstances, 

2 

each cell in the agreement matrix is expected to contain l/k 



k 

Pc (A) « I hiH. h+i 
i«i 



(7) 
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of the cases, and the total proportion of cases expected to fall 

2 

in the k diagonal cells is k(l/k ) = 1/k. The assumption of random 

assignment of cases to categories, however, seems unlikely to hold: 

Even if both raters were ignorant of the rules to be used for 

assigning cases to categories, their marginal distributions might 

depart from uniformity because of a knowledge of the base rate (as in 

the case of diagnosis), a desire to minimize false positives or 

negatives with respect to a particular category, or a response set, 

such as a tendency to avoid categories perceived as extreme. If the 

unobservable marginal distributions departed from uniformity, the 

term 1/k would be an inappropriate chance correction. Minimization 

of the expression for Pc(A) in Equation 7, subject to ^ ? constraints 
k k 

that I hi+ a I h+i a 1.00, shows that min [P^CA)] = 1/k. Therefore, 
i=-l i»l 

1/k is a lower bound to the proportion of agreement due tc chance. It 
can be shown algebraically that underestimation of Pc(A) leads to 
inflated values of A. 

A less fundamental problem with the use of the S coefficient 
was noted by Scott (1955): For a fixed value of Pq, the value of S 
increases as the number of categories, k, increases: "Given a 
two-category sex dimension and a Pq of 60 percent, the S ... would 
be 0.20. But a whimsical researcher might add two more categories, 
'h innaphrodite* and •indeterminant,* thereby increasing S to 0.47, 
though the two additional categories are not used at all" (Scott, 
1955, p* 322). 
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Scott's (1955) n coefficient was designed to overcome the 

defects of S. It does not involve an unrealistic assumption of 

random allocation under chance and does not become inflated by the 

inclusion of non-functional categories. Pc(Il) is defined as 
V ,Pi+ + P+iv2 ^ 

L \ 9 ) 9 where pjr^ and are the observed marginal 

i«l ^ 

proportions for raters 1 and 2, 'respectively. Scott argued that 
"it is convenient to assume that the distribution for the entire 
set of interviews represents the most probable (and hence 'true' in 
the long-run probability sense) distribution for any individual 
coder" (Scott, 1955, p. 324). In computing II, then, we assume that 
under chance, the raters would have identical marginals. We treat 

the quanitity ^ as the unobservable proportion of cases 

assigned to category i by both raters under chance. In terms of 

Equation 7-, we let h^^ ^ h+i = Pi-^- + P-fi ^ 

The n index was criticized by Cohen, who remarked that "one 

source of disagreement between a pair of judges is precisely their 

proclivity to distribute their judgments differently over the 

categories" (Cohen, 1960, p. 41). A similar objection was raised 

by Fleiss (1975). Cohen (1960) recommended that k, rather than II, 

be used to assess rater agreement. Pc(k) is defined as 
k 

I Pi+P+i • Thus, "chance" in this context means independence 
i«l 

of raters 1 and 2, given the obtained marginals. In applying k, 
we make the assumption that each rater's distribution of cases to 
categories categories under chance would be the same as his or her 
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observed distribution; that is hj^^. = p^^. and h^.^ = p^.^. When 
raters have the same marginals, II = k (and, for k = 2, II = k = (|) , 
the phi correlation). When, in addition, the marginals are uniform, 
as in Case I, S = II = k (for any k) • 

To further explore the properties of <, it is useful to 
examine, for fixed Pq, the effect of the rater marginals on the 
size of the coefficients. Table 3 shows three cases, all of which 
have Pq - -SO. Let us first consider the situation, represented in 
Cases I and II, in which the two raters have identical marginals. 
In Case I, Pc(k ) « .25 and k = .467, whereas in Case II, Pc(k) = 
.28 and k « .444. k is larger in Case I because, if both raters 
have the same marginal distributions, Pc(k) is minimized (and thus 
K maximized) when the marginal distributions are uniform. (This 
property applies to II as well.) This property of k and the 
analogous property of the intraclass correlation in the ordinal 
case were found objectionable by Whitehurst (1984), who regarded it 
as a statistical artifact (see also Finn, 1970; Selvage, 1976), It 
is not clear, however, that the relationship between the shape of 
the marginal distributions and the size of k is undesirable: If 
cases are concentrated into a small niunber of categories, we cannot 
determine whether our rating system includes decision criteria that 
are adequate for discrimination among all k categories. Therefore, 
it is not unreasonable that the value of an agreement coefficient 
should be smaller in this situation than in the case of uniform 
marginals. 
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Table 3 

Values of K , S, and II for Three Cases 









Rater 2 




Categories A 


BCD 


Total 






Case I: 


Marginals uniform 








(k = .A67, S = .467, n = .467) 




Rater 


1 








A 




.20 


.05 


.25 


B 






.10 .15 


.25 


C 






.1.') .10 


.25 


D 




.05 


.20 


.25 


Total 




.25 


.25 .25 .25 


1.00 






Case II: 


Marginals equal but not 


unifo 






(k = .444, 


S = .467, n = .444) 




Rater 


1 








A 




.20 


.10 .10 


.40 


B 




.10 


.10 


.20 


C 




.10 


.10 


.20 


D 






.20 


.20 


Total 




.40 


.20 .20 .20 


1.00 






Case III: 


Marginals u^iequal 








(k =.474, 


S = .467, n = .460) 




Rater 


1 








A 




.20 


.05 .05 .10 


.40 


B 






.10 .05 .05 


.20 


C 






.05 .10 .05 


.20 


D 






.20 


.20 


Total 




.20 


.20 .20 .40 


1.00 
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But let us consider another factor that atfecto the size of ic: 
the degree to v/hich raters agree in their marginal distributions. 
In both Cases II and III of Table 3, Pq = .60. In Case II, where 
the raters have identical marginals, ?q(k) = .28 and k .444. la 
Case III, however, where the raters have different marginals, Pc(f.') 
» .2^ and k « >»474. Thus the raters in Case II are penalized for 
producing identical marginals. This p! anomenon results from a 
property of < pointed out by Brennan and Prediger (1981). In. 
computing Pc(k), the marginal distributions associated with each 
rater are, in a sense, regarded as prior, despite the fact that 
they are, in themselves, evidence of the degree to which the raters 
agree. As Brennan and Prediger (1981) stated, "two judges who 
independently, and with no a priori knowledge, produce similar 
marginal distributions must obtain a much higher agreement rate to 
obtain a given value of kappa, than two judges who produce 
radically different marginals" (p. 692). This is certainly an 
undesirable property. Because there are ordir^arily no external 
restrictions on the marginals, there appears to be no justification 
for treating marginal discrepancies as an obstacle which raters 
should be credited for overcoming. 

Recommendations 



It appears that S, H , and k all have major drawbacks. S 
requires the assumption of random assignment of cases to categories 
under chance, H fails to take into account the differences between 
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rater's marginals, and k gives credit, for fixed Pq, to raters who 
produce different marginals. How, then, should inter-rater 
agre'jment be assessed? Th-2 answer lies in the examination of the 
degree of marginal agreement or homogenity per se. Rather than 
correcting for marginal dii^agreement , we should be studying it to 
determine whether we believe it reflects important rater 
differences or merely random error. The absence of discussion of 
this issue in the educational and psychological literature on 
chance-corrected agreement is striking. (Fleiss, 1965, is an 
exception, but only the dichotomous case is discussed.) 

It is proposed here that the assessment of rater agreement 
should consist of two phases: (a) the investigation of marginal 
homogeneity and (b) if marginal homogeneity holds, the computation 
of Scott's n as a measure of chance-corrected agreement. The 
rationale for this approach is as follows. If we reject the 
hypothesis of marginal homogeneity, we need go no further: We have 
sufficient information to conclude that agreement is unsatisfactory. 
On the other hand, if marginal differences are small, it is reasonable 
to apply Scott's II, thus averaging out unimportant marginal 
differences in computing Pg. If marginal differences are small, the 
value of K will, in any case, be close to that of II; the choice 
between them is therefore no longer important. 

How can we assess marginal homogeneity? If we have a fairly 
large random sample, we can make use of Stuart's (1955) test. The 
hypothesis of interest is Hg: 11+ ■ U+i, where tt^^ is the k x 1 

20 
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vector of elements '^±^9 which represent the marginal probability 
of being in row i (corresponding to rater 1) , and t^^i is the 
corresponding vector of column probabilities (corresponding to 
rater 2). The test statistic is 

x| - (Ei+ - E+i)" y-1 (£1+ - p+i) , (8) 

where (p±^ - p+i) is the (k - 1) x 1 vector of differences (Pi+ - p+i) 
between the 1th 

row marginal proportion and the i^^ column marginal 
proportion for the first k - 1 categories. (The k^^ difference is 
determined.) V is the (k - 1) x (k - 1) variance-covariance matrix 
of the random vector (p±+ - p+i), defined under Hq, with diagonal 
elements 

Pi+ + P+i - 2pii 
m = :^ (9) 

and off-diagonal elements 

= - {^^^) (10, 

where n is the sample size. The test statistic is asymptotically 
distributed as with k - 1 degrees of freedom under Hq. (When 
there are k « 2 categories, Stuart's test reduces to the McNemar 
test.) 
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As an example, consider Case III of Table 3. assuuiing n = 100. 



Then 



(Pi+ - P+ir = [(.4 - .2). (.2 - .2), (.2 - .2)J and 
'.4 + .2 - 2(.2) .05 + 0 .05 + 0 



100 100 

.2 + .2 - 2(.l) 

100 



100 

.05 + >05 

100 

.2 + .2 - 2(.l) 



100 

We find that = 21.82 is larger than X3.95 " Therefore, the 

null hypothesis of marginal homogeneity is rejected at a = .05 and no 
further investigation is needed in order to conclude that rater 
agreement is inadequate. 

It is also possible to formulate an index of marginal agreement, 
based on Stuart's test, as follows: 



1 - Xi/n 



It can be shown that max (x2) = n, the sample size. (This maximum 
occurs when one rater assigns all objects to a single category and 
the other rater assigns all objects to a different category.) 
Therefore, the proposed index takes on a value of zero under maximal 
marginal disagreement and a value of one when the marginals are 
identical. For the example above, 
21.82 



M " 1 - 



100 



.78 
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Note that for a given table of observed proportions (e.g. , Case III 
of Table 3), the value of M will be the same, regardless of sample 
size. 

To determine which categories are the source of rater 
disagreements, the post hoc procedures for Stuart's test, described 
by Marascuilo and McSweeney (1977) and Zwlck, Neuhoff, Marascuilo, 
and Levin (1982) can be applied. In fact, because these procedures 
do not involve matrix inversion, the researcher may want to perform 
only the category-by-category comparisons and bypass the overall tests. 

Although they have been ignored in education and psychology, 
tests of marginal homogeneity have been applied in this context by 
biostatisticians, such as Landis and Koch (1977). The test they 
illustrate, which can be formulated in terras of the GSK (Grizzle, 
Starmer, & Koch, 1969) approach to the analysis of categorical 
data, is essentially the same as Stuart's test. (The difference 
lies in the formulation of V. In Stuart's test, V is computed 
under the assumption that Hq is true. This restriction is not 
imposed in the GSK approach.) 

In Cases I and II, it is obvious that the hypothesis of marginal 
homogeneity would be retained. We could then use II as chance- 
corrected measure of agreement. II is always less than or equal 

to k; the equality holds when the rater marginals are identical. 

Pi+ + P+i 

For fixed values of ^ , n does not give credit, as does k, 

for marginal discrepancies between raters. Cohen's objection to II 
— that it ignores differences in rater marginals ~ is no longer an 
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issue if n is applied only when the marginal homogeneity hypothesis 
is retained. It is possible to test n for significance as well, 
although the standard error provided by Scott (1955) is not correct. 
One possible approach to hypothesis testing is given by Hubert (1977, 
pp. 293-294), who uses a matching model to derive the expected value 
and variance of a statistic equivalent to II. 
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